AITopics | audio-visual dataset

Collaborating Authors

audio-visual dataset

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Neural Information Processing SystemsDec-27-2025, 02:01:59 GMT

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at https://zenodo.org/record/7880637.

audio-visual dataset, spatiotemporal annotation, starss23, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.39)

Add feedback

Enhanced Sound Event Localization and Detection in Real 360-degree audio-visual soundscapes

Roman, Adrian S., Balamurugan, Baladithya, Pothuganti, Rithik

arXiv.org Artificial IntelligenceJan-29-2024

For this reason, the sound localization This technical report details our work towards building an performance strongly depends on the video content enhanced audio-visual sound event localization and detection [10]. This makes models prone to erroneous SELD on frames (SELD) network. We build on top of the audio-only with no audio or uncorrelated audio activity. SELDnet23 model and adapt it to be audio-visual by merging We introduce a visual branch into the audio-only SELDnet23 both audio and video information prior to the gated recurrent baseline from the Classification of Acoustic Scenes and unit (GRU) of the audio-only network.

baseline, dataset, detection, (14 more...)

arXiv.org Artificial Intelligence

2401.17129

Country:

North America > United States > California (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Middle East > Iran (0.05)
Europe > Finland > Pirkanmaa > Tampere (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback